[SPARK-34346][CORE][SQL][3.0] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression by yaooqinn · Pull Request #31492 · apache/spark

yaooqinn · 2021-02-05T18:35:02Z

Backport #31460 to 3.0

What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the hive-site.xml for their hive jobs and make a copy to SPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use spark.buffer.size(65536) to reset io.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop
This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

Does this PR introduce any user-facing change?

this pr restores silent user face change

How was this patch tested?

new tests

….size will override by loading hive-site.xml accidentally may cause perf regression In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml bugfix for configuration behavior and fix performance regression by that behavior change this pr restores silent user face change new tests Closes #31460 from yaooqinn/SPARK-34346. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

yaooqinn · 2021-02-05T18:36:40Z

cc @cloud-fan @maropu @HyukjinKwon @dongjoon-hyun thanks

SparkQA · 2021-02-05T19:58:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39521/

SparkQA · 2021-02-05T20:16:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39521/

dongjoon-hyun

+1, LGTM. Thank you, @yaooqinn .
Merged to branch-3.0.

….size will override by loading hive-site.xml accidentally may cause perf regression Backport #31460 to 3.0 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31492 from yaooqinn/SPARK-34346-30. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

SparkQA · 2021-02-05T21:39:09Z

Test build #134938 has finished for PR 31492 at commit 1157fd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun approved these changes Feb 5, 2021

View reviewed changes

dongjoon-hyun closed this Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34346][CORE][SQL][3.0] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31492

[SPARK-34346][CORE][SQL][3.0] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31492
yaooqinn wants to merge 1 commit intoapache:branch-3.0from
yaooqinn:SPARK-34346-30

yaooqinn commented Feb 5, 2021 •

edited

Loading

Uh oh!

yaooqinn commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

dongjoon-hyun left a comment

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaooqinn commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaooqinn commented Feb 5, 2021 •

edited

Loading